Care and Feeding of Catalyst Optimizer

Created with glancer

SQL Catalyst Optimizer. being in the wrong end realized that I was trying never fix at the wrong end
AI Group in Bloomberg, large clusters in the cloud
does in Apache Spark, problem with that app, planning is the problem?
about all of the green spaces very simple schema inference thing about Spark SQL is, barrier to entry is, see?
like SQL, SQL, right? running the application
SQL turn into something fashion across a cluster.
of time to come back, is what takes you from to run on the cluster.
sites in New York city. are planted on the street, catalysts go to work. this query is highlighted,
query plan in the Spark UI, some really exciting options, it's probably not exciting this presentation, trust me,
to be turning into something
color highlighting here? three arm filters, okay? gonna be really interesting. the analyzed logical plan. be a little bit interesting
what projection this query thing happened, optimization. check for all the tasks, together in one stage,
the execution of the query, a preconceive decision bunch of different decisions. cost-based optimization to order these joints is.
of a really good start. can apply various rules
like after logical optimization bit different, right? underneath the projection? interesting has happened here. asterisk that highlighted red? is what we were talking about
can be fused together. is as fast as it is. in one logical unit.
a cyclic graph of RDDs interested in the details of this, end of the presentation into very great detail.
that it some of its own now there is a partial count gonna run back on the driver. take ordered and protect. without the highlighting,
big wall of text, right? stage Codegen stages,
here the filter is now correlate with each other.
nothing happens. happen in production. out on a large Spark cluster
cluster is idle, hung, whatever, that when you need
to be persistent here with the smoking gun is going persistence, timing, and luck, first time thread dump, okay? led us back to the problem.
highly repetitive calculations. this developer's involved, different column names here, for different periods
this kind of Cartesian frame with a full block, okay? expands out like that do something with full left,
exactly the wrong way to do it that the query optimizer of repetitive process analysis of the query plan.
plan before we do anything plan, just expanding out.
set that was running Day, it was much larger, to this to start with.
to David Copperfield's, with the Cartesian product
might wanna catch that licorice that is common European countries. like it a whole lot, okay?
clear why it's called Salmiak was a data set that had, that had a very heavy heat. high frequency of key A,
see that this workload into a synthetic key
that would use a UDF, UDF was both deterministic, unpleasant text processing, calculations on the data set
about Spark and UDFs, either more or less fumes in the original query.
will be great." Right? hadn't thought through very well really gonna kind of
really high cache hit rate just zip right through, hate play it as it lays,
to make it worth my time bogged down and died. two things across purposes. to amortize the time
results in the cache. transformations like filter,
pulled again and again, was not populated for that. the benefit of caching, breaks knowledge base, people to have been here,
of the catalyst optimize of a UDF that is expensive
tried to read those before." don't need to be a genius wrong with Spark queries
a method of more than 64K, that are participating
them are really related
that you should investigate
new data set to be created dig a bit more deeply? running through Spark Jira you're the first person ever
really well documented from core developers. Spark optimizer will try the query graph as possible. with a thousand nodes at the end of your query,
a whole presentation in
could lead to a cross Join. also lead to trouble. you want running in production
on decimal columns. a certain number of places, included as an illustration, JVM, blah, blah, blah,
message in your Spark logs, optimize your queries. running the physical plan, a large application. Spark driver more memory code cash to ensure
into the driver option on every single Spark driver available code cache?
the reserve code cache size to make the code cache to keep the code cache of the code cash is free options that will be available
still living in JDK eight, code is greater than 64K. generated query grows so large,
hard limit on the JVM. that's absolutely massive. at the generated code easy to hit that limit. yuck on Spark just loggers,
do the right thing,
analyzed over and over again. have to really dig in would be either cash it truncate query in the news. impact that, okay, impact from checkpointing
you have hit some type really want to do, though. generated code in logs. some of the code generated.
one to log the entire error, query that's the problem
carbon monoxide detector, that specific class, the on call engineer not halt the application.
explain with Codegen mode." count to reduce it to just two query had seven of them,
at a really small one, okay?
correlate back to the output on the individual closets, capitalist optimizer section. that the in-memory table scan,
Codegen has keys together, artificially captured here,
see exchange in Spark UI data across a network. this is not a math operation.
of that really the exit, showing you is kind of almost, this up in your head, two, what you're seeing,
is kind of stage one that's a shuffle. highlighted, stage two, single take ordered in project
count of streets sites? finger at a specific role.
is partially rule-based. what rules are being applied executor collects metrics, to kind of reset the metrics, the logical optimization
easiest output to show redacted all the laws that not applicable here.
runs did this actually apply on the optimizer spent, the smallest code.
smallest code size possible the size of the code, through trying to Git it.
the smallest code possible, left to do optimization later. specific authorization,
real Join based optimizations something like a rule interesting in here the cause of your bum.
impact of disabling Codegen? take a look at that, again, to the first stage
not been fused together, highlighted in red, fusing the operations together, class of optimizations,
Codegen was 489 milliseconds. Codegen was 256 kilobytes. more subjective example.
totally unrelated to Codegen. makes this query faster, data set of this size? that query planning down to one at the end.
back in the physical plan there were 200 partitions. a data set of this size. partitions to only two, aggregate now takes place, memory as compared to 4.1 Gigs.
why a query could be slow of partitions involved that your query is slow. at some exciting changes
more efficient query code.
to catalyst optimizer Partition Pruning. other side of the Join? to be New Join hints, something that's been thorn